A AfterWork 


Clustering Analysis with Python 


Learning Outcomes 


By the end of this topic, you will have achieved the following learning outcomes: 


| can understand the concept of clustering analysis in unsupervised learning. 

| can differentiate between different clustering analysis methods. 

| can understand when to apply different clustering analysis methods 

| can apply the k-means clustering method in solving unsupervised learning 
problems. 

| can apply the hierarchical clustering method in solving unsupervised learning 
problems. 


“Predicting the future isn’t magic, it’s artificial intelligence.” ~Dave Waters 


Reading 


Clustering analysis is a type of analysis that allows us to find and analyze the groups that 
have formed organically. This type of analysis groups data objects based only on 
information found in the data that describes the objects and their relationships. The goal is 
that the objects within a group be similar (or related) to another and different from the 
objects in other groups. 


Clustering can be regarded as a form of classification in that it creates labeling of objects 
with class (cluster) labels. However, it derives these labels only from the data. 


The greater the similarity within a group and the greater the difference between groups, the 
better or mer distinct the clustering. Cluster analysis may be the most accurate way of 
determining defects. 


Benefits of Cluster Analysis 


Clustering allows researchers to identify and define patterns between data elements. 
Revealing these patterns between data points helps to distinguish and outline 
structures which might not have been apparent before, but which give significant 
meaning to the data once they are discovered. 

Once a clearly defined structure emerges from the dataset at hand, informed 
decision-making becomes much easier. 


Applications 


Generally used in market research, pattern recognition, data analysis, and image 
processing. 

Used to help marketers discover distinct groups in their customer base. And they can 
characterize their customer groups based on the purchasing patterns. 

Helps in identification of areas of similar land use in an earth observation database. It 
also helps in the identification of groups of houses in a city according to house type, 
value, and geographic location. 

Used to derive plant and animal taxonomies, categorize genes with similar 
functionalities and gain insight into structures inherent to populations. 

Used in outlier detection applications such as detection of credit card fraud. 


Fundamental Clustering Concepts 


While working with clustering we need to take into account of the following aspects: 


Scalability 
o We need highly scalable clustering algorithms to deal with large datasets. 
Ability to deal with different kinds of attributes 
o Algorithms should be capable of being applied on any kind of data such as 
interval-based (numerical) data, categorical, and binary data. 
High dimensionality 
o The clustering algorithm should not only be able to handle low-dimensional 
data but also the high dimensional space. 
Ability to deal with noisy data 
o Databases contain noisy, missing or erroneous data. Some algorithms are 
sensitive to such data and may lead to poor quality clusters. 
Interpretability 
o The clustering results should be interpretable, comprehensible, and usable. 


Types of Clustering 


1. Centroid Based Clustering 
e Centroid based clustering performs iterative grouping, in which the notion of 
similarity is derived by the closeness of a data point to the centroid of the 
clusters. The no. of clusters required at the end has to be determined 
beforehand, which makes it important to have prior knowledge of the dataset. 


Examples of centroid based algorithms include the K - Means Clustering and 
Mean-Shift Clustering. 


K Means Clustering 

K means allows a simple procedure of classifying a given data set into a number of 
clusters, defined by the letter "k," which is fixed beforehand. The clusters are then 
positioned as points and all observations or data points are associated with the 
nearest cluster, computed, adjusted and then the process starts over using the new 
adjustments until a desired result is reached. 


The results of the K-means clustering algorithm are: 
a. The centroids of the K clusters, which can be used to label new data. A 
centroid is a data point at the center of a cluster. 
b. Labels for the training data (each data point is assigned to a single cluster). 


Each centroid of a cluster is a collection of feature values which define the resulting 
groups. Examining the centroid feature weights can be used to qualitatively interpret 
what kind of group each cluster represents. 


e Step1 
o Select the number of clusters you want to identify in your data. 
Step 2 
o Randomly select the no. of distinct data points. 
Step 3 
o Measure the distance between the 1st point to the three initial clusters 
Step 4 
o Assign 1st point to the nearest cluster. 
o Repeat for all other points. 
Step 5 
o Calculate the mean of each cluster and repeat again step 4 until mean 
is constant. 
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Source [Link]: 


You can refer to the following video [Link] for a simplified explanation of K-means 
clustering. 


Business Uses 
e Behavioral segmentation 
o Segment by purchase history, Segment by activities on application, 
website, or platform, Define personas based on interests, Create 
profiles based on activity monitoring. 
e Inventory categorization 
o Group inventory by sales activity, group inventory by manufacturing 
metrics. 
e Sorting sensor measurements 
o Detect activity types in motion sensors, Group images, Separate 
audio, Identify groups in health monitoring. 
e Detecting bots or anomalies 
o Separate valid activity groups from bots, Group valid activity to clean 
up outlier detection. 
Advantages 
e Relatively simple to implement. 
e Scales to large data sets. 
e Guarantees convergence. 


Disadvantages 


e Fails in cases where the clusters are not circular, again as a result of using 
the mean as cluster center. 


2. Hierarchical Clustering 


Hierarchical clustering creates a tree of clusters. When we allow clusters to have 
subclusters, we obtain hierarchical clustering which is a set of nested clusters that 
are organized as a tree. Each node (cluster) in the tree (except for the leaf nodes) is 
the union of its children (subclusters), and the root of the tree is the cluster 
containing the objects. Often, but not always, the leaves of the tree are singleton 
clusters of individual data objects. 


Hierarchical clustering algorithms fall into 2 categories: 

e Divisive clustering (top-down) or Agglomerative hierarchical clustering 
(bottom-up). Bottom-up algorithms treat each data point as a single cluster at 
the outset and then successively merge (or agglomerate) pairs of clusters 
until all clusters have been merged into a single cluster that contains all data 
points. 

Going through agglomerative clustering, which is the most popularly used 
hierarchical clustering approach, we merge most similar points. Each data point is 
treated as a cluster, then joined two other data points using any of the following 
methods for deciding the similarity between two clusters: 


e Single link 

e Complete link 

e Average-link 

e Centroid distance 
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Source [Link] 


Advantages of Hierarchical Clustering 

e Easy to implement. 

e No need to specify the number of clusters. 

e Outputs a hierarchy that is more informative than unstructured clusters. 
Limitations 

e Not suitable to be used with large datasets due to complexity. 


Density based clustering 

Density-based clustering connects areas of high example density into clusters. This 
allows for arbitrary-shaped distributions as long as dense areas can be connected. 
These algorithms have difficulty with data of varying densities and high dimensions. 
Further, by design, these algorithms do not assign outliers to clusters. 


An example of a density based clustering algorithm is the DBSCAN. 


Distribution based clustering 

This clustering approach assumes data is composed of distributions, such as 
Gaussian distributions. Distribution-based algorithm clusters data into three 
Gaussian distributions. 


As distance from the distribution's center increases, the probability that a point 
belongs to the distribution decreases. The bands show that decrease in probability. 
When you do not know the type of distribution in your data, you should use a 
different algorithm. 





Source: [Link] 


Some factors to take into consideration while performing clustering include: 
e Feature Engineering - Using domain knowledge to choose which data 
metrics to input as features into a machine learning algorithm. Feature 


engineering plays a key role in K-means clustering; using meaningful features 
that capture the variability of the data is essential for the algorithm to find all 


of the naturally-occurring groups. 
e Categorical data: (i.e., category labels such as gender, country, browser 


type) needs to be encoded or separated in a way that can still work with the 


algorithm. 
e Feature transformations: particularly to represent rates rather than 
measurements, can help to normalize the data. 


Interpret Results and Adjust Clustering 


a. Quality of Clustering 
Then check these commonly-used metrics as described in the following 
sections: 
e Cluster Cardinality 
o Plot the cluster cardinality for all clusters and investigate 
clusters that are major outliers. 
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b. Cluster Magnitude 


Cluster magnitude is the sum of distances from all examples to the centroid 
of the cluster. Similar to cardinality, check how the magnitude varies across 
the clusters, and investigate anomalies. For example, the following example, 


investigate cluster number 0. Source [Link] 
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c. Finding the optimum Number of Clusters 
e K means Clustering 
o Use of Elbow method 
o Use of Silhouette method 
e Hierarchical clustering 
o Method of Mojena 


References 


You can also use the following resources for further reading. 
1. Clustering in Machine Learning [Link] 
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4. Extensive Reading on Clustering Techniques [Link] 


